LEAD SCORE ANALYSIS¶
Case Study - Summary¶
1. Company:¶
- X Education (sells online courses to professionals).
2. Current Situation:¶
- Many leads generated daily (via website visits, ads, social media, referrals).
- Only ~30% of leads convert to paying customers.
- Sales team spends excessive time contacting all leads, regardless of conversion likelihood.
3. Business Goal:¶
- Identify “Hot Leads” (those most likely to convert).
- Build a logistic regression model to assign each lead a score between 0–100.
- Higher score → higher probability of conversion.
- CEO aims to raise conversion rate to ~80% by focusing sales efforts on high-scoring leads.
4. Data Given:¶
- ~9000 historical lead records.
- Features include Lead Source, Total Time Spent on Website, Total Visits, Last Activity, etc.
- Target variable:
Converted(1 = converted, 0 = not converted). - Categorical variables contain a “Select” level → treat as missing/null.
5. Expected Results:¶
- A Jupyter notebook with:
- Logistic regression model.
- Lead score predictions.
- Model evaluation metrics (accuracy, precision, recall, ROC-AUC).
- Business insights:
- (a) Top 3 most influential variables overall.
- (b) Top 3 categorical/dummy variables impacting conversion.
- (c) Strategy when interns join → aggressive outreach to medium-score leads.
- (d) Strategy when target is met → reduce outreach to low-score leads to save resources.
In [ ]:
In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder,StandardScaler,PowerTransformer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn import metrics
from sklearn.metrics import precision_recall_curve
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
Step-1: Reading the Data¶
In [2]:
# Load and Read the dataset
df = pd.read_csv("Leads.csv")
df.head()
Out[2]:
| Prospect ID | Lead Number | Lead Origin | Lead Source | Do Not Email | Do Not Call | Converted | TotalVisits | Total Time Spent on Website | Page Views Per Visit | ... | Get updates on DM Content | Lead Profile | City | Asymmetrique Activity Index | Asymmetrique Profile Index | Asymmetrique Activity Score | Asymmetrique Profile Score | I agree to pay the amount through cheque | A free copy of Mastering The Interview | Last Notable Activity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7927b2df-8bba-4d29-b9a2-b6e0beafe620 | 660737 | API | Olark Chat | No | No | 0 | 0.0 | 0 | 0.0 | ... | No | Select | Select | 02.Medium | 02.Medium | 15.0 | 15.0 | No | No | Modified |
| 1 | 2a272436-5132-4136-86fa-dcc88c88f482 | 660728 | API | Organic Search | No | No | 0 | 5.0 | 674 | 2.5 | ... | No | Select | Select | 02.Medium | 02.Medium | 15.0 | 15.0 | No | No | Email Opened |
| 2 | 8cc8c611-a219-4f35-ad23-fdfd2656bd8a | 660727 | Landing Page Submission | Direct Traffic | No | No | 1 | 2.0 | 1532 | 2.0 | ... | No | Potential Lead | Mumbai | 02.Medium | 01.High | 14.0 | 20.0 | No | Yes | Email Opened |
| 3 | 0cc2df48-7cf4-4e39-9de9-19797f9b38cc | 660719 | Landing Page Submission | Direct Traffic | No | No | 0 | 1.0 | 305 | 1.0 | ... | No | Select | Mumbai | 02.Medium | 01.High | 13.0 | 17.0 | No | No | Modified |
| 4 | 3256f628-e534-4826-9d63-4a8b88782852 | 660681 | Landing Page Submission | No | No | 1 | 2.0 | 1428 | 1.0 | ... | No | Select | Mumbai | 02.Medium | 01.High | 15.0 | 18.0 | No | No | Modified |
5 rows × 37 columns
In [3]:
# check the shape
df.shape
Out[3]:
(9240, 37)
In [4]:
df_copy = df.copy()
Step-2: Understanding Data¶
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9240 entries, 0 to 9239 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Prospect ID 9240 non-null object 1 Lead Number 9240 non-null int64 2 Lead Origin 9240 non-null object 3 Lead Source 9204 non-null object 4 Do Not Email 9240 non-null object 5 Do Not Call 9240 non-null object 6 Converted 9240 non-null int64 7 TotalVisits 9103 non-null float64 8 Total Time Spent on Website 9240 non-null int64 9 Page Views Per Visit 9103 non-null float64 10 Last Activity 9137 non-null object 11 Country 6779 non-null object 12 Specialization 7802 non-null object 13 How did you hear about X Education 7033 non-null object 14 What is your current occupation 6550 non-null object 15 What matters most to you in choosing a course 6531 non-null object 16 Search 9240 non-null object 17 Magazine 9240 non-null object 18 Newspaper Article 9240 non-null object 19 X Education Forums 9240 non-null object 20 Newspaper 9240 non-null object 21 Digital Advertisement 9240 non-null object 22 Through Recommendations 9240 non-null object 23 Receive More Updates About Our Courses 9240 non-null object 24 Tags 5887 non-null object 25 Lead Quality 4473 non-null object 26 Update me on Supply Chain Content 9240 non-null object 27 Get updates on DM Content 9240 non-null object 28 Lead Profile 6531 non-null object 29 City 7820 non-null object 30 Asymmetrique Activity Index 5022 non-null object 31 Asymmetrique Profile Index 5022 non-null object 32 Asymmetrique Activity Score 5022 non-null float64 33 Asymmetrique Profile Score 5022 non-null float64 34 I agree to pay the amount through cheque 9240 non-null object 35 A free copy of Mastering The Interview 9240 non-null object 36 Last Notable Activity 9240 non-null object dtypes: float64(4), int64(3), object(30) memory usage: 2.6+ MB
In [6]:
df.describe()
Out[6]:
| Lead Number | Converted | TotalVisits | Total Time Spent on Website | Page Views Per Visit | Asymmetrique Activity Score | Asymmetrique Profile Score | |
|---|---|---|---|---|---|---|---|
| count | 9240.000000 | 9240.000000 | 9103.000000 | 9240.000000 | 9103.000000 | 5022.000000 | 5022.000000 |
| mean | 617188.435606 | 0.385390 | 3.445238 | 487.698268 | 2.362820 | 14.306252 | 16.344883 |
| std | 23405.995698 | 0.486714 | 4.854853 | 548.021466 | 2.161418 | 1.386694 | 1.811395 |
| min | 579533.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 7.000000 | 11.000000 |
| 25% | 596484.500000 | 0.000000 | 1.000000 | 12.000000 | 1.000000 | 14.000000 | 15.000000 |
| 50% | 615479.000000 | 0.000000 | 3.000000 | 248.000000 | 2.000000 | 14.000000 | 16.000000 |
| 75% | 637387.250000 | 1.000000 | 5.000000 | 936.000000 | 3.000000 | 15.000000 | 18.000000 |
| max | 660737.000000 | 1.000000 | 251.000000 | 2272.000000 | 55.000000 | 18.000000 | 20.000000 |
Step-3: Data Preparation¶
1. Null Value Handling¶
In [7]:
# Remove duplicates
df.drop_duplicates(inplace=True) # inplace = True to make the changes permanent
In [8]:
# Check for the null values using isnull() function
df.isnull().sum()
Out[8]:
Prospect ID 0 Lead Number 0 Lead Origin 0 Lead Source 36 Do Not Email 0 Do Not Call 0 Converted 0 TotalVisits 137 Total Time Spent on Website 0 Page Views Per Visit 137 Last Activity 103 Country 2461 Specialization 1438 How did you hear about X Education 2207 What is your current occupation 2690 What matters most to you in choosing a course 2709 Search 0 Magazine 0 Newspaper Article 0 X Education Forums 0 Newspaper 0 Digital Advertisement 0 Through Recommendations 0 Receive More Updates About Our Courses 0 Tags 3353 Lead Quality 4767 Update me on Supply Chain Content 0 Get updates on DM Content 0 Lead Profile 2709 City 1420 Asymmetrique Activity Index 4218 Asymmetrique Profile Index 4218 Asymmetrique Activity Score 4218 Asymmetrique Profile Score 4218 I agree to pay the amount through cheque 0 A free copy of Mastering The Interview 0 Last Notable Activity 0 dtype: int64
In [9]:
# converting null values into percentages for better understanding
round((df.isnull().sum()/len(df)*100),2)
Out[9]:
Prospect ID 0.00 Lead Number 0.00 Lead Origin 0.00 Lead Source 0.39 Do Not Email 0.00 Do Not Call 0.00 Converted 0.00 TotalVisits 1.48 Total Time Spent on Website 0.00 Page Views Per Visit 1.48 Last Activity 1.11 Country 26.63 Specialization 15.56 How did you hear about X Education 23.89 What is your current occupation 29.11 What matters most to you in choosing a course 29.32 Search 0.00 Magazine 0.00 Newspaper Article 0.00 X Education Forums 0.00 Newspaper 0.00 Digital Advertisement 0.00 Through Recommendations 0.00 Receive More Updates About Our Courses 0.00 Tags 36.29 Lead Quality 51.59 Update me on Supply Chain Content 0.00 Get updates on DM Content 0.00 Lead Profile 29.32 City 15.37 Asymmetrique Activity Index 45.65 Asymmetrique Profile Index 45.65 Asymmetrique Activity Score 45.65 Asymmetrique Profile Score 45.65 I agree to pay the amount through cheque 0.00 A free copy of Mastering The Interview 0.00 Last Notable Activity 0.00 dtype: float64
In [10]:
# Handle 'SELECT' or placeholder values ( treat them as NaN)
df.replace('Select',np.nan, inplace=True)
In [11]:
# Drop columns with too many missing values [missing values >40%]
missing_ratio = df.isnull().mean()
cols_to_drop = missing_ratio[missing_ratio > 0.4].index
df.drop(columns = cols_to_drop,inplace=True)
In [12]:
df.shape
Out[12]:
(9240, 30)
In [13]:
# Handling null values for categorical columns (replacing with mode )
for col in df.select_dtypes(include='object').columns :
df[col].fillna(df[col].mode()[0],inplace = True)
In [14]:
# Handling null values for numerical columns ( replacing with the median )
for col in df.select_dtypes(include =['int64','float64']).columns:
df[col].fillna(df[col].median(),inplace = True)
In [15]:
df.isnull().sum()
Out[15]:
Prospect ID 0 Lead Number 0 Lead Origin 0 Lead Source 0 Do Not Email 0 Do Not Call 0 Converted 0 TotalVisits 0 Total Time Spent on Website 0 Page Views Per Visit 0 Last Activity 0 Country 0 Specialization 0 What is your current occupation 0 What matters most to you in choosing a course 0 Search 0 Magazine 0 Newspaper Article 0 X Education Forums 0 Newspaper 0 Digital Advertisement 0 Through Recommendations 0 Receive More Updates About Our Courses 0 Tags 0 Update me on Supply Chain Content 0 Get updates on DM Content 0 City 0 I agree to pay the amount through cheque 0 A free copy of Mastering The Interview 0 Last Notable Activity 0 dtype: int64
Hence the null values in the data are removed. Therefore there are no null values in the data¶
2. Outliers Handling¶
In [16]:
# check the outliers present in the data set by check the quantilies [.25, .5, .75, .90, .95, .99]
df.describe(percentiles=[.25, .5, .75, .90, .95, .99])
Out[16]:
| Lead Number | Converted | TotalVisits | Total Time Spent on Website | Page Views Per Visit | |
|---|---|---|---|---|---|
| count | 9240.000000 | 9240.000000 | 9240.000000 | 9240.000000 | 9240.000000 |
| mean | 617188.435606 | 0.385390 | 3.438636 | 487.698268 | 2.357440 |
| std | 23405.995698 | 0.486714 | 4.819024 | 548.021466 | 2.145781 |
| min | 579533.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 596484.500000 | 0.000000 | 1.000000 | 12.000000 | 1.000000 |
| 50% | 615479.000000 | 0.000000 | 3.000000 | 248.000000 | 2.000000 |
| 75% | 637387.250000 | 1.000000 | 5.000000 | 936.000000 | 3.000000 |
| 90% | 650506.100000 | 1.000000 | 7.000000 | 1380.000000 | 5.000000 |
| 95% | 655404.050000 | 1.000000 | 10.000000 | 1562.000000 | 6.000000 |
| 99% | 659592.980000 | 1.000000 | 17.000000 | 1840.610000 | 9.000000 |
| max | 660737.000000 | 1.000000 | 251.000000 | 2272.000000 | 55.000000 |
In [17]:
# Visualize the outliers in the numerical columns
num_col = df.select_dtypes(include=['int64','float64']).columns
for i in num_col:
if i != 'Converted':
plt.figure(figsize = (6,4))
sns.boxplot(df[i])
plt.show()
#sns.boxplot(df['Lead Number'])
In [18]:
# Checking for outliers and Handling them using Inter Quartile Range
num_col = df.select_dtypes(include=['int64','float64']).columns
for i in num_col:
Q1 = df[i].quantile(0.25)
Q3 = df[i].quantile(0.75)
IQR = Q3-Q1
lower_bound = Q1-1.5*IQR
upper_bound = Q3+1.5*IQR
df[i] = np.where(df[i] > upper_bound, upper_bound,
np.where(df[i] < lower_bound, lower_bound, df[i])) # Capping the outliers
In [19]:
df.shape
Out[19]:
(9240, 30)
In [20]:
# Visualize the outliers in the numerical columns after handling them
num_col = df.select_dtypes(include=['int64','float64']).columns
for i in num_col:
if i != 'Converted':
plt.figure(figsize = (6,4))
sns.boxplot(df[i])
plt.show()
print("Outliers of the numerical columns are handled")
Outliers of the numerical columns are handled
3. Identify numerical and categorical columns in the dataset¶
In [21]:
numerical_col = df.select_dtypes(include = ['int64','float64']).columns
categorical_col = df.select_dtypes(include = ['object']).columns
numerical_col
Out[21]:
Index(['Lead Number', 'Converted', 'TotalVisits',
'Total Time Spent on Website', 'Page Views Per Visit'],
dtype='object')
In [22]:
df.head()
Out[22]:
| Prospect ID | Lead Number | Lead Origin | Lead Source | Do Not Email | Do Not Call | Converted | TotalVisits | Total Time Spent on Website | Page Views Per Visit | ... | Digital Advertisement | Through Recommendations | Receive More Updates About Our Courses | Tags | Update me on Supply Chain Content | Get updates on DM Content | City | I agree to pay the amount through cheque | A free copy of Mastering The Interview | Last Notable Activity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7927b2df-8bba-4d29-b9a2-b6e0beafe620 | 660737.0 | API | Olark Chat | No | No | 0.0 | 0.0 | 0.0 | 0.0 | ... | No | No | No | Interested in other courses | No | No | Mumbai | No | No | Modified |
| 1 | 2a272436-5132-4136-86fa-dcc88c88f482 | 660728.0 | API | Organic Search | No | No | 0.0 | 5.0 | 674.0 | 2.5 | ... | No | No | No | Ringing | No | No | Mumbai | No | No | Email Opened |
| 2 | 8cc8c611-a219-4f35-ad23-fdfd2656bd8a | 660727.0 | Landing Page Submission | Direct Traffic | No | No | 1.0 | 2.0 | 1532.0 | 2.0 | ... | No | No | No | Will revert after reading the email | No | No | Mumbai | No | Yes | Email Opened |
| 3 | 0cc2df48-7cf4-4e39-9de9-19797f9b38cc | 660719.0 | Landing Page Submission | Direct Traffic | No | No | 0.0 | 1.0 | 305.0 | 1.0 | ... | No | No | No | Ringing | No | No | Mumbai | No | No | Modified |
| 4 | 3256f628-e534-4826-9d63-4a8b88782852 | 660681.0 | Landing Page Submission | No | No | 1.0 | 2.0 | 1428.0 | 1.0 | ... | No | No | No | Will revert after reading the email | No | No | Mumbai | No | No | Modified |
5 rows × 30 columns
Converting some binary variables (Yes/No) to 0/1¶
In [23]:
variable_list = ['Do Not Email', 'Do Not Call', 'Search', 'Magazine', 'Newspaper Article',
'X Education Forums', 'Newspaper', 'Digital Advertisement',
'Through Recommendations', 'Receive More Updates About Our Courses',
'I agree to pay the amount through cheque', 'A free copy of Mastering The Interview','Update me on Supply Chain Content','Get updates on DM Content']
In [24]:
#df[variable_list] = df[variable_list].apply(lambda x: x.map({'Yes': 1,'No': 0}))
# creating a function binary_map to convert yes/no columns to 0/1
def binary_map(x):
return x.map({'Yes': 1, "No": 0})
# Applying the function to the housing list
df[variable_list] = df[variable_list].apply(binary_map)
In [25]:
df.head()
Out[25]:
| Prospect ID | Lead Number | Lead Origin | Lead Source | Do Not Email | Do Not Call | Converted | TotalVisits | Total Time Spent on Website | Page Views Per Visit | ... | Digital Advertisement | Through Recommendations | Receive More Updates About Our Courses | Tags | Update me on Supply Chain Content | Get updates on DM Content | City | I agree to pay the amount through cheque | A free copy of Mastering The Interview | Last Notable Activity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7927b2df-8bba-4d29-b9a2-b6e0beafe620 | 660737.0 | API | Olark Chat | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | Interested in other courses | 0 | 0 | Mumbai | 0 | 0 | Modified |
| 1 | 2a272436-5132-4136-86fa-dcc88c88f482 | 660728.0 | API | Organic Search | 0 | 0 | 0.0 | 5.0 | 674.0 | 2.5 | ... | 0 | 0 | 0 | Ringing | 0 | 0 | Mumbai | 0 | 0 | Email Opened |
| 2 | 8cc8c611-a219-4f35-ad23-fdfd2656bd8a | 660727.0 | Landing Page Submission | Direct Traffic | 0 | 0 | 1.0 | 2.0 | 1532.0 | 2.0 | ... | 0 | 0 | 0 | Will revert after reading the email | 0 | 0 | Mumbai | 0 | 1 | Email Opened |
| 3 | 0cc2df48-7cf4-4e39-9de9-19797f9b38cc | 660719.0 | Landing Page Submission | Direct Traffic | 0 | 0 | 0.0 | 1.0 | 305.0 | 1.0 | ... | 0 | 0 | 0 | Ringing | 0 | 0 | Mumbai | 0 | 0 | Modified |
| 4 | 3256f628-e534-4826-9d63-4a8b88782852 | 660681.0 | Landing Page Submission | 0 | 0 | 1.0 | 2.0 | 1428.0 | 1.0 | ... | 0 | 0 | 0 | Will revert after reading the email | 0 | 0 | Mumbai | 0 | 0 | Modified |
5 rows × 30 columns
In [26]:
# Remove the column 'Prospect ID' it is of no use to us
df.drop('Prospect ID',axis =1, inplace = True)
For categorical variables with multiple levels, create dummy features (one-hot encoded)¶
In [27]:
categorical_cols = df.select_dtypes(include='object').columns.tolist()
In [28]:
categorical_cols
Out[28]:
['Lead Origin', 'Lead Source', 'Last Activity', 'Country', 'Specialization', 'What is your current occupation', 'What matters most to you in choosing a course', 'Tags', 'City', 'Last Notable Activity']
In [29]:
# create dummy variables for the categorical columns
dummy_variables = pd.get_dummies(df[categorical_cols], drop_first= True).astype(int)
In [30]:
# Merge the dummy variables with the original dataset
df = pd.concat([df,dummy_variables],axis=1)
df.shape
Out[30]:
(9240, 175)
In [31]:
# Drop the categorical_cols from the data as we have created dummy variables for them
df.drop(categorical_cols, axis =1,inplace = True)
In [32]:
df.shape
Out[32]:
(9240, 165)
In [33]:
df.head()
Out[33]:
| Lead Number | Do Not Email | Do Not Call | Converted | TotalVisits | Total Time Spent on Website | Page Views Per Visit | Search | Magazine | Newspaper Article | ... | Last Notable Activity_Form Submitted on Website | Last Notable Activity_Had a Phone Conversation | Last Notable Activity_Modified | Last Notable Activity_Olark Chat Conversation | Last Notable Activity_Page Visited on Website | Last Notable Activity_Resubscribed to emails | Last Notable Activity_SMS Sent | Last Notable Activity_Unreachable | Last Notable Activity_Unsubscribed | Last Notable Activity_View in browser link Clicked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 660737.0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 660728.0 | 0 | 0 | 0.0 | 5.0 | 674.0 | 2.5 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 660727.0 | 0 | 0 | 1.0 | 2.0 | 1532.0 | 2.0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 660719.0 | 0 | 0 | 0.0 | 1.0 | 305.0 | 1.0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 660681.0 | 0 | 0 | 1.0 | 2.0 | 1428.0 | 1.0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 165 columns
Checking the conversion rate¶
In [34]:
lead_conversion_rate = (sum(df['Converted'])/len(df['Converted'].index))*100
lead_conversion_rate
Out[34]:
38.53896103896104
Step-4 : Looking at correlations¶
In [46]:
# Correlation matrix using heatmap
plt.figure(figsize=(60,40))
sns.heatmap(df.corr(),annot = True)
plt.show()